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Abstract 

The performance of the emerging petafiops-scale su- 
percomputers of the nearest future (hypercomputers) 
will be governed not only by the clock frequency of the 
processing nodes or by the width of the system bus, but 
also by such factors as the overall power consumption 
and the geometric size. In this paper, we study the 
influence of such parameters on one of the most im- 
portant characteristics of a general purpose computer 
— on the degree of multithreading that must be present 
in an application to make the use of the hypercomputer 
justifiable. Our major finding is that for the class of 
applications with purely random memory access pat- 
terns "super-fast computing" and "high-performance 
computing" are essentially synonyms for "massively- 
parallel computing". 



1. Introduction 

Super-fast computers processing data at a sustained 
rate on the order of 10 15 integer or floating-point opera- 
tions per second (1 petaops, or 1 petaflops), also known 
as hypercomputers [9] , will be emerging within the next 
decade as ultimate tools for solving very large-scale 
problems of computational fluid dynamics, weather 
forecasting, nuclear stockpile stewardship, cryptanaly- 
sis, real-time image processing and rendering, and the 
like [3J. 

Common sense supported by the results of prelimi- 
nary case studies [TT] suggests that the hypercomputers 
will materialize as hardware installations of substantial 
size and power consumption. The average geometric 
diameter of the installation, combined with the ultra- 
high clock frequency, will be eventually translated into 
a memory access latency of several hundreds and thou- 
sands processor cycles, — a situation unthinkable in 
the domain of personal computers but quite common 



on the Internet. To achieve and sustain the required 
performance, the hypercomputer must be originally 
designed as a highly multithreaded machine [TU1 H]- 
Preemptive multithreading helps to hide the memory 
access latency. However, it implies high parallelism, 
which inevitably limits the usability of a hypercom- 
puter to a narrow domain of intrinsically parallel ap- 
plications. Careful consideration of physical factors can 
help to anticipate the potential problems that may ren- 
der the design of a hypercomputer doomed to failure. 

In this paper, we will obtain a rough parametric es- 
timation of the performance of hypercomputers based 
on their fundamental physical and geometric proper- 
ties, such as power consumption and wire size. 

2. Model 

For the purpose of this study, the following simpli- 
fied model of a hypercomputer has been used. We as- 
sume that the hypercomputer consists of Q nodes, each 
node being either a processing element (PE), or a mem- 
ory bank. The nodes are connected using a multistage 
internal network. The diameter of the network D is on 
the order of log 2 Q (this is true for delta networks and 
approximately true for other high-performance net- 
works). For the ease of application development, all 
processing elements have uniform access to the globally 
shared memory. A typical application using the hyper- 
computer generates purely random memory traffic at a 
rate of 1.32 ("load") requests and 0.78 replies ("store") 
per clock cycle [7], or approximately 1 outbound mes- 
sage per cycle per node. All instructions are presumed 
to be fetched from local instruction caches and do not 
contribute to the total traffic. Data caches are not con- 
sidered, taking into account the random pattern of the 
memory usage. Finally, we assume that the processor 
word width is W bits, the processor clock frequency 
is /o, and that each PE completes one instruction per 
clock cycle. 
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Figure 1. Arrangement of the components of 
a hypercomputer. A sample message path is 
shown with a dashed line. 



3. Power Consumption 

Electrical power is consumed by the hypercomputer 
statically and dynamically. The static term is at- 
tributed to the leakage current (which can be ignored, 
at least in theory) and to the power dissipation in pas- 
sive interconnecting wires. The dynamic term depends 
on the performance of the hypercomputer. 

Let us begin with the evaluation of the total num- 
ber of wires required to interconnect the processing el- 
ements. The signal transfer rate on a wire (limited by 
the wire bandwidth B w ) may be substantially slower 
than the PE clock rate /n. Respectively, the amount 
of passive wires in the network must be proportion- 
ally larger to match the total bandwidth B of requests 
generated by the PEs, and the available bandwidth of 
the network. Each stage of a multistage network con- 
tributes proportionally to the total number of wires, 
too. Finally, we must add extra wires to compensate 
for the network saturation, which typically takes place 
at a ~ 60 % load: 



To achieve its ultimate performance, the system 
must be well balanced in a sense that the round-trip 
memory access latency, measured in PE clock cycles, 
should be approximately equal to the degree of multi- 
threading. In this thread blocked at a memory 
request will be scheduled for execution by the hard- 
ware exactly when the results arrive to the local regis- 
ters. Smaller degree of multithreading will reduce the 
performance of the hypercomputer, while higher degree 
will require extra hardware for thread contexts, most 
of which will never be used. 

As a first step toward the refining of the proposed 
model, we observe that the design of a petaflops-scale 
hypercomputer implies three-dimensional integration. 
Indeed, it has been shown [2] that the footprint of 
a hypercomputer flattened into the two-dimensional 
space would be as large as a soccer field (namely, 
~ 1,000 to 2 ). The actual arrangement of the com- 
ponents (PEs, memory banks, and internal network 
nodes) is not essential for the study. We will focus on 
a rather unrealistic, but easy to model, spherical con- 
figuration, with all active components evenly placed 
on the surface of a sphere of diameter L, and all pas- 
sive components (wires) hidden under the surface, as 
shown in Figure [T] (A similar — but technically more 
sound — cylindrical arrangement has been proposed 
in [1] and [12].) Such configuration permits relatively 
easy access to the active components in case they need 
maintenance or replacement. 
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Second, we must establish a relationship between 
the performance of the hypercomputer and its config- 
uration and clock frequency. The aggregate peak per- 
formance of the hypercomputer, measured in floating 
point operations per second, can be roughly estimated 
as 

e = Qfo (W/W ) , (2) 

where Wq is the number of bits per word in a "stan- 
dard" processing element. The ratio in the parentheses 
takes into account the fact that, for instance, a 128-bit 
PE is twice as powerful as its 64-bit counterpart run- 
ning at the same clock rate. 

The size of the hypercomputer installation will be 
defined by the amount of power p v that can be possibly 
removed from a unit volume by means of either forced- 
air or water cooling. The maximum power that can be 
removed in the former case is p s = 5 • 10 5 W/m 2 [5]. 
Water cooling can remove more power, but requires 
more sophisticated and bulky plumbing. At the mo- 
ment, we do not know what will be the ultimate verti- 
cal chip pitch h for the 3-dimensional integration. The 
pitch of h = 5 to to sounds like a sane approximation, 
with a proper allowance for the packaging and cool- 
ing infrastructure. Under this assumption, the maxi- 
mum power that can be removed from a unit volume 
is p v = p s /h = 10 8 W/m 3 . 



3.1. "Test Vehicle" Hypercomputer 



3.3. Dynamic Power Dissipation 



To verify our theoretical reasonings, we will consider 
a hypothetical hypercomputer of year 2007. This "Test 
Vehicle" hypercomputer (TVHC) will be driven by Q = 
50, 000 super-fast 128-bit Intel chips (/ = 20 GHz 0). 
The nodes will be connected using a banyan network 
(D = log 2 Q « 16) implemented as a collection of in- 
sulated thin pure copper wires (bandwidth per wire 
B w « 3.6 Gbps [5]; resistivity p = 17.5 • 10 -9 fl-m; wire 
electrical cross-section a w = 2.5 • 10 _8 m 2 ). One can 
verify using Eq. [5] that the peak performance of this 
hypercomputer will be 10 15 operations per second, or 
1 petaops. 

3.2. Static Power Dissipation 

Power dissipated statically by a passive resistive 
electrical system is given by Ohm's law: P s = I 2 R, 
where I is the signal current, and R is the overall re- 
sistance of the system. We assume that / w ±20 mA, 
although higher-current drivers may be needed to sus- 
tain error-prone high bit rate transmission at meter- 
scale distances. 

The interconnection network can be ultimately con- 
sidered as a collection of N individual wires of length 
li, with electrical cross-section <j w , made out of a good 
conductor with resistivity p. It can be shown that the 
average distance between any two components on a 
sphere L is 2L/tt. The wires are connected in series, 
and the total resistance is: 



N 



N 



R = ^2 R i = — li 
Finally, 
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(3) 



The total heat generated by the static power dissi- 
pation P s must be removed from the chip, according 
to the conditions stated above. This is only possible, 
if the volume occupied by the wiring is large enough: 
P s < p v V = irpvL^t/Q. Substituting P s from Eq.[3]and 
N from Eq. Q] and Eq. [21 we get the final dependency 
of L st on O: 



Dynamic power dissipation is due to the fact that 
each operation executed by any PE requires certain 
energy (in our case, w « 10~ 10 J /op [S]). We have to 
consider heat generated by both processing elements 
and memories (there are 2Q of them), and switching 
elements (there are at least QD /2 of them, assuming a 
delta-class interconnection network). We do not know 
the exact relationship between the complexity of op- 
erations executed by the switching engines and com- 
putational engines, and for the purpose of this study 
we will assume that they are equivalent. Therefore, 
the total dynamic power dissipation P d in the hyper- 
computer is equal to 8u> (2 + D/2). According to the 
model proposed in Sec. [51 active processing and switch- 
ing elements are spread on the surface of the sphere 
enclosing the passive interconnection wires, forming 
an "active shell". The surface of the sphere must 
be spacious enough to enable adequate heat removal: 

J \yn % 



@w (2 + D/2) < nl?, w w . Obviously, 
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(5) 



For the TVHC, Ld yn = 0.8 m. This is certainly an 
optimistic estimation, because a lot of power is required 
for various support operations, such as PE "housekeep- 
ing" and memory refreshing. 

3.4. Power Dissipation in Drivers 

Yet another source of dynamic power consumption 
is the set of drivers responsible for the transmission 
of digital signals from one agent to another along the 
interconnecting wires. Each driver constitutes a cur- 
rent source injecting either +1 or —I into the attached 
wire, at voltage V. To reduce noise and decrease bit 
error rate, the drivers must be placed as close to the 
agents as possible, and therefore are located on the 
same surface of the "active core" . Altogether, 2N 
drivers are required, with the total power dissipation 
of Pdr = 2NIU. Again, the surface of the core must 
be spacious enoug h: 2NIU <nL 2 dr p s . Naturally, 
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The diameter of the "static thermal core" for the 
TVHC L st is w 0.008 m. 



Under the assumption of a really low-voltage driver 
(V — 1 V), the diameter of the "thermal core" expands 
to Ldr ~ 4.9 m. 

The diameters of all three thermal spheres consid- 
ered so far — Eq. [H Eq. [SJ and Eq. [6] — scale as 



Ve. This means, in particular, that the size of the 
shell will be determined by static, dynamic, or driver- 
related power dissipation, but not by all of them at a 
time. More specifically, 
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(7) 



To summarize: the size of the "minimal thermal 
core" of the hypercomputer suggested in Subsection l3.ll 
must conform to the driver power dissipation require- 
ments. The surface of the conforming core will be large 
enough to accommodate the processing and switch- 
ing elements, and the volume of the core will be large 
enough to fit the interconnection wires — without in- 
troducing additional power constraints. 

4. Wiring Constraints 

Alternatively, the ultimate size of a hypercomputer 
can be estimated by considering how much space is 
required to contain the copper wires constituting the 
interconnection network. 

If a cross-section of a single interconnecting wire 
(including appropriate insulation, cooling, mechanical 
support, etc.) is a, and there is the total of N wires 
constituting the interconnection network, then the to- 
tal physical volume V\ occupied by the wiring is: 

Vi = aNL g = 2aNL g /ir. 

On the other hand, this volume cannot exceed the vol- 
ume of the core: 

V 2 = nL 3 g /6. 
Therefore, the following simple equation holds: 



L g = VUoN/tt , 



(8) 



For interchip connections implemented on a printed cir- 
cuit board (PCB), a may be chosen to be on the order 



of 10" 



(wires are placed at rj 0.3mm pitch). 



Substituting Eq. Q] into Eq. we obtain the de- 
pendence of the average network size on the PE clock 
frequency: 



L g = Jf WQ 
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(9) 



Notice that the parameters in the parentheses are be- 
yond our control. (D is a slow function of N and can 
be considered a constant.) 

Combining Eq. [2] and Eq. [9j we discover that the 
average "packing" size of the interprocessor network 
again scales as the square root of the performance of 
the hypercomputer: 



L q = Jew 
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(10) 



We would like to emphasize that Eq. \TU\ has been ob- 
tained exclusively by considering the geometric volume 
necessary to contain the passive interconnecting wires. 

The "packing" size of the hypercomputer given by 
Equation [TU1 is almost 9.8 m. 

5. Parallelism 

The "well-balanced" condition postulated in Sec. 
imposes even stricter requirements on the scaling of a 
hypercomputer. The net effect of the geometry of the 
system on the expected degree of parallelism will be 
discussed in this section. 

In a "well balanced" system, the number of thread 
contexts per PE (or the amount of parallelism, T) must 
be large enough to tolerate the round trip latency of a 
memory access measured in PE clock cycles. The la- 
tency includes the signal propagation time r p , message 
processing overhead r n , and memory response time r m : 

(2LD DC \ 

T = {T p + T n + T m ) f = h — h T m f . 

\ C S f J 

(11) 

Here, c s is the signal propagation speed (in copper, 
c s sa 9 • 10 7 m/s), and C is the number of PE cycles re- 
quired for message processing at one internal network 
node (we take C ~ 10, but believe that it may be as 
low as 1). It can be shown that for the hypercomputer 
proposed above, the first term dominates the other two. 
Indeed, t p 2.25 /is, t„ w 5ns (the first and the sec- 
ond terms in Eq. fTTT) , and r m w Ins 5 . For the rest 
of our reasoning, we may safely assume that 
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(12) 



The comparison of Eq. [7] and Eq. [TU] with re- 
spect to "our" hypercomputer suggests (Figure ^ 
that the geometric considerations dominate the power- 
management considerations, regardless of the perfor- 
mance of the installation. Therefore, the study of the 
power consumption may be safely omitted, and we can 
concentrate on the geometric term. 

The combination of Eq. [TU] and Eq. [12] gives the de- 
pendence of T on the hypercomputer clock frequency 
and overall performance: 
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(13) 



For the TVHC, T - 70,000. As usual, the factors 
collected in the parentheses are beyond our control. 
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Figure 2. The minimal diameter of the TVHC 
installation as a function of its performance. 



6. Solutions 

An unpleasant consequence of the equation [TjJ] is 
that the amount of intrinsic parallelism required from 
an application in order to be efficiently executed by a 
hypercomputer is proportional to the clock frequency 
of the PE and to the square root of the overall per- 
formance of the machine. This means that "super-fast 
computing" and "high-performance computing" are es- 
sentially synonyms for "massively-parallel computing" , 
and as such cannot be considered suitable for general- 
purpose applications with a purely random memory 
access pattern. 

A number of solutions may be suggested to this 
problem. One way to circumvent the "packing" con- 
straint is to use open-space optical interconnects. For 
these kind of links, one can expect to have the band- 
width Bo ~ 40 Gbps per link, with signal propagation 
speed c s — 3 • 10 8 m/s. An important property of an 
open-space network is that the links can actually over- 
lap. Therefore, the size of the core will not be limited 
by volume anymore. Instead, it will be limited by the 
area of the inner surface of the shell: 



Here, ol.e is the footprint of a light emitting ele- 
ment, for instance, vertical cavity surface emitting 
laser (VCSEL). Assuming that the size of a VCSEL 
is 200 fim x 200 /im [6] , the diameter of the shell L g 
will be 1=3 2 to — a big improvement, compared to the 
"copper" shell. It is also worth mentioning that the 
static power dissipation in an open-space network is 
zero, due to the absence of wires. 



We could not find reliable information on the power 
consumption of very high-speed VCSELs and photo- 
diodes. An intelligent guess is that at 40 Gbps, power 
required by a single emitter is ~ 0.1 mW. Equation[B] 
gives the size of the "driver core": L dr ps 3.3m. As 
one can see, the "driver" shell becomes bigger than the 
"packing" shell and determines the size of the TVHC. 
Once again, we would like to emphasize that we have 
no solid numbers for very high-speed VCSELs, and the 
result of this calculation must be considered exclusively 
as a rough estimate. 

There exists at least yet another alternative to cop- 
per wires. They can be replaced with high-speed bal- 
listic high-T c superconductor (HTSC) ceramic wires. 
HTSC wires promise high data transfer rates (B w ps 
10 Gbps) and high signal propagation speed (s c ps 
2 • 10 8 m/s). These two factors together can reduce the 
"packing" size and the degree of parallelism by 40% and 
60%, respectively. However, the ultimate cross-section 
of ceramic wires is not know now, and this third fac- 
tor may potentially undo the improvement. There will 
be still at least some gain, unless the HTSC wires are 
6 ■ 10~ 7 m 2 in cross-section or thicker. 

The biggest improvement that can be brought in 
by the HTSC wires is the shrinkage of the "driver" 
core. Superconductor drivers may consume as little as 
10 fiW of power, compared to 20 mW for semiconduc- 
tor drivers. This would reduce the size of the respective 
core to Ldr ~ 0.5 m, which would allow us to totally 
exclude it from the consideration. 

Unfortunately, HTSC wires can operate only at the 
temperature of liquid nitrogen and require deep refrig- 
eration. The dissipated power will be removed else- 
where (namely, at the nitrogen liquifier setup, which 
may be located outside of the shell) and will not con- 
tribute to the power balance of the core. However, the 
cryogenic infrastructure may (and apparently will) in- 
flate the effective cross-section a of the interconnects. 
The net effect of this inflation is not known yet. 

7. Conclusion 

We have considered the parametric dependences of 
the geometric size of a hypothetical petaflops- scale hy- 
percomputer on the geometric size and power proper- 
ties of its interconnection network. We discovered that 
the size of a hypercomputer with spherical arrange- 
ment of active components (processing and switching 
elements and memories) scales as the square root of 
the aggregate peak performance: L ~ \/0. In order to 
sustain the execution rate, the hypercomputer must be 
designed as a highly multithreaded machine. As such, 
it will be most suited for highly parallel applications. 



Even though it may be possible to reduce the degree 
of parallelism by optimizing the implementation of the 
network, it is questionable whether a general-purpose 
application with purely random memory access pattern 
can benefit from being executed by the hypercomputer. 
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